1 Simple Example beforehand: T-shirts’ Size

At the first, they don’t have these lables, they only have some information about customers. Let’s think about how tailer make your customer-tailored T-shirt. They may measure your neck width(collar), arm length, chest width, waistline and so on. But they don’t have any lable to make as fewer as possible sizes so that they can save cost. Let’s say they only want to have five moduel. So the problem is how to find these five sizes so that most of the customers can buy a comfortable one, and meanwhile, when they have the right size, the T-shirt is not too large or to small. In statistics, this problem is equivalent to find five clusters based on provided information so that the variation within clusters is small, and between clusters variation is large.

Add more visual graphs to illustrate the clustering.

go to top

2 Summary of Seeds data

This tutorial illustrates the implentations of clustering and association rule mining in (R).

We use the seeds data set to demonstrate cluster analysis in R. The examined group comprised kernels belonging to three different varieties of wheat: Kama, Rosa and Canadian, 70 elements each. A description of the dataset can be viewed at (https://archive.ics.uci.edu/ml/datasets/seeds)

seed = read.table('http://archive.ics.uci.edu/ml/machine-learning-databases/00236/seeds_dataset.txt', header=F)
seed = seed[,1:7]
colnames(seed) = c("area", "perimeter","campactness", "length", "width", "asymmetry", "groovelength")

Scale data to have zero mean and unit variance for each column

seed <- scale(seed) 

go to top

3 K-means

Use k-means method for clustering and plot results.

# K-Means Cluster Analysis
fit <- kmeans(seed, 5) #5 cluster solution
#Display number of clusters in each cluster
table(fit$cluster)
## 
##  1  2  3  4  5 
## 51 40 45 56 18
#Plot cluster in kmeans
install.packages("fpc")
library(fpc)
plotcluster(seed, fit$cluster)

#See exactly which item are in 1st group
seed[fit$cluster==1,]
#get cluster means for scaled data
aggregate(seed,by=list(fit$cluster),FUN=mean)
##   Group.1       area  perimeter campactness     length      width
## 1       1 -1.1092967 -1.0549469  -1.3160126 -0.8935647 -1.2531431
## 2       2 -0.5016064 -0.5683837   0.3844330 -0.6850222 -0.3143374
## 3       3  0.2466954  0.2704211   0.3439821  0.2325225  0.3085097
## 4       4  1.3885128  1.3867754   0.5989659  1.3721886  1.2746370
## 5       5 -0.6788679 -0.7383738   0.1510023 -0.7962992 -0.4877120
##    asymmetry groovelength
## 1  0.6425384  -0.55553010
## 2 -0.8757721  -0.98640013
## 3 -0.3995553   0.04190449
## 4 -0.0910015   1.39643675
## 5  1.4076390  -0.68322884
#or alternatively, use the output of kmeans
fit$centers
##         area  perimeter campactness     length      width  asymmetry
## 1 -1.1092967 -1.0549469  -1.3160126 -0.8935647 -1.2531431  0.6425384
## 2 -0.5016064 -0.5683837   0.3844330 -0.6850222 -0.3143374 -0.8757721
## 3  0.2466954  0.2704211   0.3439821  0.2325225  0.3085097 -0.3995553
## 4  1.3885128  1.3867754   0.5989659  1.3721886  1.2746370 -0.0910015
## 5 -0.6788679 -0.7383738   0.1510023 -0.7962992 -0.4877120  1.4076390
##   groovelength
## 1  -0.55553010
## 2  -0.98640013
## 3   0.04190449
## 4   1.39643675
## 5  -0.68322884

3.1 Determine number of clusters

Here a simple within group sum of squares method is used.

# Determine number of clusters
wss <- (nrow(seed)-1)*sum(apply(seed,2,var))
for (i in 2:12) wss[i] <- sum(kmeans(seed,
                                     centers=i)$withinss)
plot(1:12, wss, type="b", xlab="Number of Clusters",ylab="Within groups sum of squares")

Prediction strength for estimating number of clusters. The prediction strength is defined according to Tibshirani and Walther (2005), who recommend to choose as optimal number of cluster the largest number of clusters that leads to a prediction strength above 0.8 or 0.9.

prediction.strength(seed, Gmin=2, Gmax=15, M=10,cutoff=0.8)
## Prediction strength 
## Clustering method:  kmeans 
## Maximum number of clusters:  15 
## Resampled data sets:  10 
## Mean pred.str. for numbers of clusters:  1 0.8792773 0.7846101 0.5229052 0.411331 0.3615098 0.3258512 0.2736393 0.2739454 0.2085326 0.197504 0.1630583 0.1592821 0.1131956 0.1276479 
## Cutoff value:  0.8 
## Largest number of clusters better than cutoff:  2

fpc package has cluster.stat() function that can calcuate other cluster validity measures such as Average Silhouette Coefficient (between -1 and 1, the higher the better), or Dunn index (betwen 0 and infinity, the higher the better):

d = dist(seed, method = "euclidean")
result = matrix(nrow = 14, ncol = 3)
for (i in 2:15){
  cluster_result = kmeans(seed, i)
  clusterstat=cluster.stats(d, cluster_result$cluster)
  result[i-1,1]=i
  result[i-1,2]=clusterstat$avg.silwidth
  result[i-1,3]=clusterstat$dunn   
}
plot(result[,c(1,2)], type="l", ylab = 'silhouette width', xlab = 'number of clusters')

plot(result[,c(1,3)], type="l", ylab = 'dunn index', xlab = 'number of clusters')

The package NbClust provides 30 indexes for determining the optimal number of clusters in a data set.

For more sophiscated methods, see for example blog, or course notes.

This article on Cross Validated provides a great illustration of the situations when k-means would fail.

go to top

4 Hierarchical clustering

#Wards Method or Hierarchical clustering
#Calculate the distance matrix
seed.dist=dist(seed)
#Obtain clusters using the Wards method
seed.hclust=hclust(seed.dist, method="ward")
plot(seed.hclust)

#Cut dendrogram at the 3 clusters level and obtain cluster membership
seed.3clust = cutree(seed.hclust,k=3)
#See exactly which item are in third group
seed[seed.3clust==3,]
#get cluster means for raw data
#Centroid Plot against 1st 2 discriminant functions
#Load the fpc library needed for plotcluster function
library(fpc)
#plotcluster(ZooFood, fit$cluster)
plotcluster(seed, seed.3clust)

go to top

5 (Optional) Model-Based Cluster Analysis

A newer clustering appraoch, model-based cluster, treats the clustering problem as maximizing a Normal mixture model. Generating an observation in this model consists of first picking a centroid (mean of a multivariate normal distribution) at random and then adding some noise (variances). If the noise is normally distributed, this procedure will result in clusters of spherical shape. Model-based clustering assumes that the data were generated by a model and tries to recover the original model from the data. The model that we recover from the data then defines clusters and an assignment of documents to clusters. It can be thought as a generalization of \(K\)-means.

The model “recovering” process is done via Expectation-Maximization(EM) algorithm. It is an iterative approach to maximize the likelihood of a statistical model when the model contains unobserved variables.

One obvious advantage of the approach is that we can treat the question “How Many Clusters?” as a model selection problem.

For detailed description of the method and the package, see 1 and 2

install.packages('mclust')
library(mclust)
mclust_result = Mclust(seed)
summary(mclust_result)
## ----------------------------------------------------
## Gaussian finite mixture model fitted by EM algorithm 
## ----------------------------------------------------
## 
## Mclust EEV (ellipsoidal, equal volume and shape) model with 4 components:
## 
##  log.likelihood   n  df      BIC      ICL
##        416.1639 210 122 179.9808 172.0903
## 
## Clustering table:
##  1  2  3  4 
## 44 50 67 49

The BIC used in the package is the negative of the ‘usual’ BIC when we discussed regression models. Therefore we are trying to maximize the BIC here.

plot(mclust_result)

go to top

6 Association Rules

Association Rules is a popular and well researched method for discovering interesting relations between variables in large databases. arules package in R provides a basic infrastructure for creating and manipulating input data sets and for analyzing the resulting itemsets and rules.

The Groceries data set contains 1 month (30 days) of real-world point-of-sale transaction data from a typical local grocery outlet. The data set contains 9835 transactions and the items are aggregated to 169 categories.

library(arules)
data("Groceries")
#run summary report
summary(Groceries)
## transactions as itemMatrix in sparse format with
##  9835 rows (elements/itemsets/transactions) and
##  169 columns (items) and a density of 0.02609146 
## 
## most frequent items:
##       whole milk other vegetables       rolls/buns             soda 
##             2513             1903             1809             1715 
##           yogurt          (Other) 
##             1372            34055 
## 
## element (itemset/transaction) length distribution:
## sizes
##    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15 
## 2159 1643 1299 1005  855  645  545  438  350  246  182  117   78   77   55 
##   16   17   18   19   20   21   22   23   24   26   27   28   29   32 
##   46   29   14   14    9   11    4    6    1    1    1    1    3    1 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   2.000   3.000   4.409   6.000  32.000 
## 
## includes extended item information - examples:
##        labels  level2           level1
## 1 frankfurter sausage meat and sausage
## 2     sausage sausage meat and sausage
## 3  liver loaf sausage meat and sausage

summary() displays the most frequent items in the data set, information about the transaction length distribution and that the data set contains some extended transaction information. We see that the data set contains transaction IDs. This additional information can be used for analyzing the data set.

To find the very long transactions we can use the size() and select very long transactions (containing more than 25 items).

#
x = Groceries[size(Groceries) > 25]
inspect(x)
##     items                     
## [1] {sausage,                 
##      finished products,       
##      tropical fruit,          
##      pip fruit,               
##      other vegetables,        
##      butter,                  
##      curd,                    
##      dessert,                 
##      butter milk,             
##      yogurt,                  
##      whipped/sour cream,      
##      soft cheese,             
##      frozen vegetables,       
##      rolls/buns,              
##      white bread,             
##      brown bread,             
##      pastry,                  
##      pasta,                   
##      soups,                   
##      baby food,               
##      fruit/vegetable juice,   
##      salty snack,             
##      waffles,                 
##      cake bar,                
##      chocolate,               
##      shopping bags}           
## [2] {frankfurter,             
##      sausage,                 
##      liver loaf,              
##      ham,                     
##      chicken,                 
##      beef,                    
##      citrus fruit,            
##      tropical fruit,          
##      root vegetables,         
##      other vegetables,        
##      whole milk,              
##      butter,                  
##      curd,                    
##      yogurt,                  
##      whipped/sour cream,      
##      beverages,               
##      soft cheese,             
##      hard cheese,             
##      cream cheese ,           
##      mayonnaise,              
##      domestic eggs,           
##      rolls/buns,              
##      roll products ,          
##      flour,                   
##      pasta,                   
##      margarine,               
##      specialty fat,           
##      sugar,                   
##      soups,                   
##      skin care,               
##      hygiene articles,        
##      candles}                 
## [3] {meat,                    
##      turkey,                  
##      tropical fruit,          
##      root vegetables,         
##      onions,                  
##      other vegetables,        
##      whole milk,              
##      butter,                  
##      yogurt,                  
##      whipped/sour cream,      
##      sliced cheese,           
##      hard cheese,             
##      frozen meals,            
##      frozen dessert,          
##      domestic eggs,           
##      white bread,             
##      salt,                    
##      rice,                    
##      oil,                     
##      margarine,               
##      mustard,                 
##      baking powder,           
##      dog food,                
##      bottled water,           
##      fruit/vegetable juice,   
##      rum,                     
##      long life bakery product,
##      chocolate,               
##      cooking chocolate}       
## [4] {beef,                    
##      hamburger meat,          
##      tropical fruit,          
##      root vegetables,         
##      onions,                  
##      other vegetables,        
##      whole milk,              
##      butter,                  
##      yogurt,                  
##      whipped/sour cream,      
##      sliced cheese,           
##      mayonnaise,              
##      frozen vegetables,       
##      frozen potato products,  
##      rolls/buns,              
##      white bread,             
##      rice,                    
##      oil,                     
##      specialty fat,           
##      specialty vegetables,    
##      canned fish,             
##      coffee,                  
##      instant coffee,          
##      bottled beer,            
##      liquor (appetizer),      
##      long life bakery product,
##      chocolate,               
##      napkins,                 
##      house keeping products}  
## [5] {frankfurter,             
##      sausage,                 
##      liver loaf,              
##      beef,                    
##      tropical fruit,          
##      pip fruit,               
##      root vegetables,         
##      onions,                  
##      whole milk,              
##      butter,                  
##      yogurt,                  
##      whipped/sour cream,      
##      sliced cheese,           
##      cream cheese ,           
##      frozen meals,            
##      frozen potato products,  
##      domestic eggs,           
##      rolls/buns,              
##      roll products ,          
##      salt,                    
##      rice,                    
##      margarine,               
##      jam,                     
##      coffee,                  
##      long life bakery product,
##      waffles,                 
##      hygiene articles}        
## [6] {sausage,                 
##      pip fruit,               
##      root vegetables,         
##      onions,                  
##      whole milk,              
##      butter,                  
##      dessert,                 
##      soft cheese,             
##      cream cheese ,           
##      specialty cheese,        
##      domestic eggs,           
##      salt,                    
##      pasta,                   
##      vinegar,                 
##      oil,                     
##      mustard,                 
##      ketchup,                 
##      canned vegetables,       
##      pickled vegetables,      
##      canned fish,             
##      instant coffee,          
##      tea,                     
##      misc. beverages,         
##      bottled beer,            
##      white wine,              
##      chocolate,               
##      abrasive cleaner,        
##      hygiene articles}        
## [7] {sausage,                 
##      pork,                    
##      citrus fruit,            
##      tropical fruit,          
##      pip fruit,               
##      root vegetables,         
##      whole milk,              
##      butter,                  
##      curd,                    
##      yogurt,                  
##      cream,                   
##      sliced cheese,           
##      hard cheese,             
##      frozen vegetables,       
##      frozen fish,             
##      frozen potato products,  
##      brown bread,             
##      margarine,               
##      baking powder,           
##      cat food,                
##      pet care,                
##      soda,                    
##      fruit/vegetable juice,   
##      nut snack,               
##      chocolate,               
##      female sanitary products,
##      hygiene articles,        
##      napkins,                 
##      house keeping products}

To see which items are important in the data set we can use the itemFrequencyPlot(). To reduce the number of items, we only plot the item frequency for items with a support greater than 10%. The label size is reduced with the parameter cex.names.

#
itemFrequencyPlot(Groceries, support = 0.1, cex.names=0.8)

Use apriori() algorithm to find all rules (the default association type for apriori()) with a minimum support of 0.3% and a confidence of 0.5.

# Run the apriori algorithm
basket_rules <- apriori(Groceries,parameter = list(sup = 0.003, conf = 0.5,target="rules"))
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.5    0.1    1 none FALSE            TRUE       5   0.003      1
##  maxlen target   ext
##      10  rules FALSE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 29 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
## sorting and recoding items ... [136 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 done [0.00s].
## writing ... [421 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].
summary(basket_rules)
## set of 421 rules
## 
## rule length distribution (lhs + rhs):sizes
##   2   3   4   5 
##   5 281 128   7 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.000   3.000   3.000   3.325   4.000   5.000 
## 
## summary of quality measures:
##     support           confidence          lift      
##  Min.   :0.003050   Min.   :0.5000   Min.   :1.957  
##  1st Qu.:0.003355   1st Qu.:0.5238   1st Qu.:2.135  
##  Median :0.003965   Median :0.5556   Median :2.426  
##  Mean   :0.004754   Mean   :0.5715   Mean   :2.522  
##  3rd Qu.:0.005186   3rd Qu.:0.6094   3rd Qu.:2.766  
##  Max.   :0.022267   Max.   :0.8857   Max.   :5.804  
## 
## mining info:
##       data ntransactions support confidence
##  Groceries          9835   0.003        0.5

Recall from class:

# Check the generated rules using inspect
inspect(head(basket_rules))
##     lhs                        rhs                support     confidence
## [1] {cereals}               => {whole milk}       0.003660397 0.6428571 
## [2] {specialty cheese}      => {other vegetables} 0.004270463 0.5000000 
## [3] {rice}                  => {other vegetables} 0.003965430 0.5200000 
## [4] {rice}                  => {whole milk}       0.004677173 0.6133333 
## [5] {baking powder}         => {whole milk}       0.009252669 0.5229885 
## [6] {root vegetables,herbs} => {other vegetables} 0.003863752 0.5507246 
##     lift    
## [1] 2.515917
## [2] 2.584078
## [3] 2.687441
## [4] 2.400371
## [5] 2.046793
## [6] 2.846231

As typical for association rule mining, the number of rules found is huge. To analyze these rules, for example, subset() can be used to produce separate subsets of rules. Now find the subset of rules that has 4 or more length (lhs+rhs).

#Basket rules of size greater than 4
inspect(subset(basket_rules, size(basket_rules)>4))
##     lhs                   rhs                    support confidence     lift
## [1] {citrus fruit,                                                          
##      tropical fruit,                                                        
##      root vegetables,                                                       
##      other vegetables} => {whole milk}       0.003152008  0.7045455 2.757344
## [2] {citrus fruit,                                                          
##      tropical fruit,                                                        
##      root vegetables,                                                       
##      whole milk}       => {other vegetables} 0.003152008  0.8857143 4.577509
## [3] {citrus fruit,                                                          
##      tropical fruit,                                                        
##      other vegetables,                                                      
##      whole milk}       => {root vegetables}  0.003152008  0.6326531 5.804238
## [4] {citrus fruit,                                                          
##      root vegetables,                                                       
##      other vegetables,                                                      
##      whole milk}       => {tropical fruit}   0.003152008  0.5438596 5.183004
## [5] {tropical fruit,                                                        
##      root vegetables,                                                       
##      other vegetables,                                                      
##      yogurt}           => {whole milk}       0.003558719  0.7142857 2.795464
## [6] {tropical fruit,                                                        
##      root vegetables,                                                       
##      whole milk,                                                            
##      yogurt}           => {other vegetables} 0.003558719  0.6250000 3.230097
## [7] {tropical fruit,                                                        
##      root vegetables,                                                       
##      other vegetables,                                                      
##      whole milk}       => {yogurt}           0.003558719  0.5072464 3.636128

Find the subset of rules with lift greater than 5:

inspect(subset(basket_rules, lift>5))
##     lhs                   rhs                   support confidence     lift
## [1] {citrus fruit,                                                         
##      tropical fruit,                                                       
##      other vegetables,                                                     
##      whole milk}       => {root vegetables} 0.003152008  0.6326531 5.804238
## [2] {citrus fruit,                                                         
##      root vegetables,                                                      
##      other vegetables,                                                     
##      whole milk}       => {tropical fruit}  0.003152008  0.5438596 5.183004

Now find the subset rules that has Yogurt in the right hand side. Here we require lift measure exceeds 1.2.

yogurt.rhs <- subset(basket_rules, subset = rhs %in% "yogurt" & lift>3.5)

Now inspect the subset rules

inspect(yogurt.rhs)
##     lhs                     rhs          support confidence     lift
## [1] {whipped/sour cream,                                            
##      cream cheese }      => {yogurt} 0.003355363  0.5238095 3.754859
## [2] {root vegetables,                                               
##      cream cheese }      => {yogurt} 0.003762074  0.5000000 3.584184
## [3] {tropical fruit,                                                
##      curd}               => {yogurt} 0.005287239  0.5148515 3.690645
## [4] {other vegetables,                                              
##      whole milk,                                                    
##      cream cheese }      => {yogurt} 0.003457041  0.5151515 3.692795
## [5] {tropical fruit,                                                
##      whole milk,                                                    
##      curd}               => {yogurt} 0.003965430  0.6093750 4.368224
## [6] {tropical fruit,                                                
##      other vegetables,                                              
##      butter}             => {yogurt} 0.003050330  0.5555556 3.982426
## [7] {tropical fruit,                                                
##      whole milk,                                                    
##      butter}             => {yogurt} 0.003355363  0.5409836 3.877969
## [8] {tropical fruit,                                                
##      whole milk,                                                    
##      whipped/sour cream} => {yogurt} 0.004372140  0.5512821 3.951792
## [9] {tropical fruit,                                                
##      root vegetables,                                               
##      other vegetables,                                              
##      whole milk}         => {yogurt} 0.003558719  0.5072464 3.636128

Now find the subset rules that has Meat in the left hand side.

meat.lhs <- subset(basket_rules, subset = lhs %in% "meat" & lift>1.5)

Now inspect the subset rules

inspect(meat.lhs)
##     lhs                       rhs          support     confidence lift    
## [1] {meat,root vegetables} => {whole milk} 0.003152008 0.62       2.426462

We can use the arulesViz package to visualize the rules, for a more complete introduction see (http://cran.r-project.org/web/packages/arulesViz/vignettes/arulesViz.pdf).

install.packages('arulesViz')
library('arulesViz')
plot(basket_rules)

The plot function has an interactive mode for you to inspect individual rules:

plot(basket_rules, interactive=TRUE)

Graph-based visualization can be used for very small sets of rules. The vertices are represented by items for the 10 rules with highest lift:

plot(head(sort(basket_rules, by="lift"), 10), method = "graph")

The package comes with an approach to cluster association rules and itemsets:

plot(basket_rules, method="grouped")

go to top

7 Case Starter Code

For problem 1 Iris data, simply use the Iris dataset in R. When doing cluster analysis for Iris you’ll want to ignore the Species variable.

data(iris)

For problem 2 Cincinnati Zoo data, use the following code to load the transaction data for association rules mining. as() function coerce the dataset into transaction data type for association rules mining.

TransFood <- read.csv('http://homepages.uc.edu/~maifg/DataMining/data/food_4_association.csv')
TransFood <- TransFood[, -1]
TransFood <- as(as.matrix(TransFood), "transactions")

Load the data for clustering:

Food_by_month <- read.csv('http://homepages.uc.edu/~maifg/DataMining/data/qry_Food_by_Month.csv')

go to top